{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Week8 bonus descriptions\n",
    "\n",
    "Here are some cool mini-projects you can try to dive deeper into the topic.\n",
    "\n",
    "## More metrics: BLEU (5+ pts)\n",
    "\n",
    "Pick BLEU or any other relevant metric, e.g. BLEU (e.g. from `nltk.bleu_score`).\n",
    "* Train model to maximize BLEU directly\n",
    "* How does levenshtein behave when maximizing BLEU and vice versa?\n",
    "* Compare this with how they behave when optimizing likelihood. \n",
    "\n",
    "(use default parameters for bleu: 4-gram, uniform weights)\n",
    "\n",
    "## Actor-critic (5+++ pts)\n",
    "\n",
    "While self-critical training provides a large reduction of gradient variance, it has a few drawbacks:\n",
    "- It requires a lot of additional computation during training\n",
    "- It doesn't adjust V(s) between decoder steps. (one value per sequence)\n",
    "\n",
    "There's a more general way of doing the same thing: learned baselines, also known as __advantage actor-critic__.\n",
    "\n",
    "There are two main ways to apply that:\n",
    "- __naive way__: compute V(s) once per training example.\n",
    "  - This only requires additional 1-unit linear dense layer that grows out of encoder, estimating V(s)\n",
    "  - (implement this to get main points)\n",
    "- __every step__: compute V(s) on each decoder step\n",
    "  - Again it's just an 1-unit dense layer (no nonlinearity), but this time it's inside decoder recurrence.\n",
    "  - (+3 pts additional for this guy)\n",
    "\n",
    "In both cases, you should train V(s) to minimize squared error $(V(s) - R(s,a))^2$ with R being actual levenshtein.\n",
    "You can then use $ A(s,a) = (R(s,a) - const(V(s))) $ for policy gradient.\n",
    "\n",
    "There's also one particularly interesting approach (+5 additional pts):\n",
    "- __combining SCST and actor-critic__:\n",
    "  - compute baseline $V(s)$ via self-critical sequence training (just like in main assignment)\n",
    "  - learn correction $ C(s,a_{:t}) = R(s,a) - V(s) $ by minimizing $(R(s,a) - V(s) - C(s,a_{:t}))^2 $\n",
    "  - use $ A(s,a_{:t}) = R(s,a) - V(s) - const(C(s,a_{:t})) $\n",
    "\n",
    "\n",
    "\n",
    "## Implement attention (5+++ pts)\n",
    "\n",
    "Some seq2seq tasks can benefit from the attention mechanism. In addition to taking the _last_ time-step of encoder hidden state, we can allow decoder to peek on any time-step of his choice.\n",
    "\n",
    "![img](https://xiandong79.github.io/downloads/nmt-model-fast.gif)\n",
    "\n",
    "\n",
    "#### Recommended steps:\n",
    "__1)__ Modify encoder-decoder\n",
    "\n",
    "Learn to feed the entire encoder into the decoder. You can do so by sending encoder rnn layer directly into decoder (make sure there's no `only_return_final=True` for encoder rnn layer).\n",
    "\n",
    "```\n",
    "class decoder:\n",
    "    ...\n",
    "    encoder_rnn_input = InputLayer(encoder.rnn.output_shape, name='encoder rnn input for decoder')\n",
    "    ...\n",
    "    \n",
    "#decoder Recurrence\n",
    "rec = Recurrence(...,\n",
    "                 input_nonsequences = {decoder.encoder_rnn_input: encoder.rnn},\n",
    "                 )\n",
    "\n",
    "```\n",
    "\n",
    "For starters, you can take it's last tick (via SliceLayer) inside the decoder step and feed it as input to make sure it works.\n",
    "\n",
    "__2)__ Implement attention mechanism\n",
    "\n",
    "Next thing we'll need is to implement the math of attention.\n",
    "\n",
    "The simplest way to do so is to write a special layer. We gave you a prototype and some tests below.\n",
    "\n",
    "__3)__ Use attention inside decoder\n",
    "\n",
    "That's almost it! Now use `AttentionLayer` inside the decoder and feed it to back to lstm/gru/rnn (see code demo below).\n",
    "\n",
    "Train the full network just like you did before attention.\n",
    "\n",
    "__More points__ will be awwarded for comparing learning results of attention Vs no attention.\n",
    "\n",
    "__Bonus bonus:__ visualize attention vectors (>= +3 points)\n",
    "\n",
    "The best way to make sure your attention actually works is to visualize it.\n",
    "\n",
    "A simple way to do so is to obtain attention vectors from each tick (values __right after softmax__, not the layer outputs) and drawing those as images.\n",
    "\n",
    "#### step-by-step guide:\n",
    "- split AttentionLayer into two layers: _\"from start to softmax\"_ and _\"from softmax to output\"_\n",
    "- add outputs of the first layer to recurrence's `tracked_outputs`\n",
    "- compile a function that computes them\n",
    "- plt.imshow(them)\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np",
    "\n",
    "import theano",
    "\n",
    "import lasagne",
    "\n",
    "import theano.tensor as T",
    "\n",
    "from lasagne import init",
    "\n",
    "from lasagne.layers import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class AttentionLayer(MergeLayer):",
    "\n",
    "    def __init__(self, decoder_h, encoder_rnn):",
    "\n",
    "        # sanity checks",
    "\n",
    "        assert len(",
    "\n",
    "            decoder_h.output_shape) == 2, \"please feed decoder 1 step activation as first param \"",
    "\n",
    "        assert len(",
    "\n",
    "            encoder_rnn.output_shape) == 3, \"please feed full encoder rnn sequence as second param\"",
    "\n",
    "\n",
    "        self.decoder_num_units = decoder_h.output_shape[-1]",
    "\n",
    "        self.encoder_num_units = encoder.output_shape[-1]",
    "\n",
    "\n",
    "        # Here you should initialize all trainable parameters.",
    "\n",
    "        #",
    "\n",
    "\n",
    "        # use this syntax:",
    "\n",
    "        self.add_param(spec=init.Normal(std=0.01),  # or other initializer",
    "\n",
    "                       shape= < shape tuple > ,",
    "\n",
    "                       name='<param name here>')",
    "\n",
    "\n",
    "        MergeLayer.__init__(self, [decoder_h, encoder_rnn], name=\"attention\")",
    "\n",
    "\n",
    "    def get_output_shape_for(self, input_shapes, **kwargs):",
    "\n",
    "        \"\"\"return matrix of shape [batch_size, encoder num units]\"\"\"",
    "\n",
    "        return (None, self.encoder_num_units)",
    "\n",
    "\n",
    "    def get_output_for(self, inputs, **kwargs):",
    "\n",
    "        \"\"\"",
    "\n",
    "        takes (decoder_h, encoder_seq)",
    "\n",
    "        decoder_h has shape [batch_size, decoder num_units]",
    "\n",
    "        encoder_seq has shape [batch_size, sequence_length, encoder num_units]",
    "\n",
    "\n",
    "        returns attention output: matrix of shape [batch_size, encoder num units]",
    "\n",
    "\n",
    "        please read comments carefully before you start implementing",
    "\n",
    "        \"\"\"",
    "\n",
    "        decoder_h, encoder_seq = inputs",
    "\n",
    "\n",
    "        # get symbolic batch-size / seq length. Also don't forget self.decoder_num_units above",
    "\n",
    "        batch_size, seq_length, _ = tuple(encoder_seq.shape)",
    "\n",
    "\n",
    "        # here's a recommended step-by-step guide for attention mechanism.",
    "\n",
    "        # You are free to ignore it alltogether if you so wish",
    "\n",
    "\n",
    "        # we repeat decoder activations to allign with encoder",
    "\n",
    "        decoder_h_repeated = <cast decoder_h into[batch, seq_length, decoer_num_units] by",
    "\n",
    "                              repeating it _seq_length_ times >",
    "\n",
    "                             <use T.repeat and maybe some reshape>",
    "\n",
    "        # ^--shape=[batch,seq_length,decoder_n_units]",
    "\n",
    "        ",
    "\n",
    "        encoder_and_decoder_together = <concatenate repeated decoder and encoder over last axis>",
    "\n",
    "        # ^--shape=[batch,seq_length,enc_n_units+dec_n_units]",
    "\n",
    "        ",
    "\n",
    "        # here we flatten the tensor to simplify",
    "\n",
    "        encoder_and_decoder_flat = T.reshape(encoder_and_decoder_together,(-1,encoder_and_decoder_together.shape[-1]))",
    "\n",
    "        # ^--shape=[batch*seq_length,enc_n_units+dec_n_units]",
    "\n",
    "        ",
    "\n",
    "        # here you use encoder_and_decoder_flat and some learned weights to predict attention logits",
    "\n",
    "        # don't use softmax yet",
    "\n",
    "        <your code here>",
    "\n",
    "        attention_logits_flat = <logits to be used as attention weights>",
    "\n",
    "        # ^--shape=[batch*seq_length,1]",
    "\n",
    "        ",
    "\n",
    "        ",
    "\n",
    "        # here we reshape flat logits back into correct form",
    "\n",
    "        assert attention_logits_flat.ndim==2",
    "\n",
    "        attention_logits = attention_logits_flat.reshape((batch_size,seq_length))",
    "\n",
    "        # ^--shape=[batch,seq_length]",
    "\n",
    "        ",
    "\n",
    "        # here we apply softmax :)",
    "\n",
    "        attention = T.nnet.softmax(attention_logits)",
    "\n",
    "        # ^--shape=[batch,seq_length]",
    "\n",
    "        ",
    "\n",
    "        # here we compute output",
    "\n",
    "        output = (attention[:,:,None]*encoder_seq).sum(axis=1) #sum over seq_length",
    "\n",
    "        # ^--shape=[batch,enc_n_units]",
    "\n",
    "        ",
    "\n",
    "        return output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# demo code",
    "\n",
    "\n",
    "from numpy.random import randn",
    "\n",
    "\n",
    "dec_h_prev = InputLayer((None, 50), T.constant(",
    "\n",
    "    randn(5, 50)), name='decoder h mock')",
    "\n",
    "\n",
    "enc = InputLayer((None, None, 32), T.constant(",
    "\n",
    "    randn(5, 20, 32)), name='encoder sequence mock')",
    "\n",
    "\n",
    "attention = AttentionLayer(dec_h_prev, enc)",
    "\n",
    "\n",
    "# now you can use attention as additonal input to your decoder",
    "\n",
    "# LSTMCell(prev_cell,prev_out,input_or_inputs=(usual_input,attention))",
    "\n",
    "\n",
    "\n",
    "# sanity check",
    "\n",
    "demo_output = get_output(attention).eval()",
    "\n",
    "print 'actual shape:', demo_output.shape",
    "\n",
    "assert demo_output.shape == (5, 32)",
    "\n",
    "assert np.isfinite(demo_output)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
